fix(bench): update elixir/julia/objc expected-edges to module-qualified names#1496
Conversation
… diff Adds snapshot-pre-bash.sh (PreToolUse Bash) + track-bash-writes.sh (PostToolUse Bash): the pre-hook captures git status --porcelain to a per-worktree temp file before each Bash call; the post-hook diffs the before/after state and appends newly modified or created files to .claude/session-edits.log. This closes the gap where files written by sed -i, printf redirects, tee, heredocs, or build tools (Cargo.lock, lockfiles) were never recorded, causing guard-git.sh to emit false-positive BLOCKED errors. Closes #1457
- clojure.rs: annotate lifetime-anchor assignment to silence false-positive - cfg.rs: remove never-called start_line_of method - complexity.rs: remove never-constructed NotHandled variant; convert irrefutable if-let patterns to plain let destructures - dataflow.rs: remove never-read callee fields from CallReturn/Destructured - incremental.rs: remove never-read lang field from CacheEntry cargo check and cargo clippy both clean after these changes.
Adds .github/workflows/perf-canary.yml — a path-filtered workflow that fires on PRs touching src/extractors/, src/domain/graph/, or crates/** and runs only the incremental-benchmark suite (full build + no-op + 1-file rebuild, both engines). Catches the class of regressions that accumulated invisibly across the Phase 8.x PRs and were only detected at v3.12.0 publish time. The regression guard gains BENCH_CANARY=1 mode: raises thresholds to 50%/100%/150% (standard/noisy/WASM) and skips the build, query, and resolution suites — only incremental checks run. This absorbs shared- runner timing variance while still blocking catastrophic regressions (+98% full build, +1827% 1-file rebuild from v3.12.0). Closes #1433
On incremental builds, runPostNativeCha previously scanned all call→qualified-method edges in the DB (~12ms flat, O(graph size)), even for 1-file changes where no hierarchy or RTA evidence changed. Add two cheap indexed gate queries. Gate A checks whether any changed file introduced a class/interface/trait/struct/record node (hierarchy may have new implementors reachable from unchanged call sites). Gate B checks whether any changed file added a call edge to a class-kind target (RTA set may have grown, enabling previously filtered expansions in unchanged callers). If neither gate fires, restrict the candidate query to src.file IN changedFiles — safe because the hierarchy and instantiated set are unchanged for all other files. Full builds (isFullBuild=true) and cases where either gate fires retain the existing full-scan behaviour. Mirrors the changed-files scoping pattern of runPostNativeThisDispatch. Closes #1441
Times each JS post-pass in tryNativeOrchestrator and exposes the
measurements in BuildResult.phases:
- gapDetectMs — dropped-language gap detection + backfill
- chaMs — CHA expansion (interface dispatch)
- thisDispatchMs — this/super dispatch WASM re-parse (was already
tracked but now properly named alongside the rest)
- reclassifyMs — scoped role re-classification after edge insertion
- techniqueBackfillMs — technique-column UPDATE on native-written edges
Previously only thisDispatchMs was reported, causing wall-clock vs
phaseSum to diverge by 1.1s+ on 1-file rebuilds and making benchmark
regressions undiagnosable from committed history.
Updates update-incremental-report.ts to render the new phases in a
collapsible details block under each engine's 1-file rebuild section.
Closes #1434
…ld for required-tier grammars The docstring claimed pool cost was "amortised over enough parse work" — measurements show IPC overhead scales linearly (~55–64ms/file pool vs ~8–10ms/file inline). The real motivation is crash safety for exotic WASM grammars (#965); JS/TS/TSX (required-tier, used in all this-dispatch backfill calls) have never triggered the V8 fatal crash class and are safe to run inline. Raise threshold 16 → 32 to keep typical this-dispatch batches (≤ 18 files on the codegraph corpus) on the inline fast path. Exotic-language drops are almost always well under 32 files and also benefit from the inline path without meaningful crash risk increase. Closes #1435
…e incremental rebuilds On 1-file native incremental builds, two JS post-passes ran unconditionally even when they had no work to do: - `backfillNativeDroppedFiles`: called whenever changedCount > 0, even when detectDroppedLanguageGap returned an empty gap. Gate now checks gap.missingAbs.length > 0 || gap.staleRel.length > 0 directly, matching backfillNativeDroppedFiles's own internal early-exit guard. - Node/edge COUNT(*) re-count: ran unconditionally after all post-passes even when none of them wrote any edges. COUNT(*) over 50K+ edge tables is non-trivial, especially via the NativeDbProxy napi-rs round-trip. Now gated on postPassWroteData (backfill | CHA edges | this-dispatch edges). Closes #1454
The post-pass it timed (runPostNativePrototypeMethods) was deleted in b5c03a2 when func-prop extraction moved to Rust (#1432). The optional field was never set by any code path that survived the deletion. Also remove the stale reference to "prototype-methods post-pass" from the parseFilesWasmForBackfill docstring — only the this-dispatch post-pass uses symbolsOnly now. Closes #1432
… collision Field type annotations (`private repo: OrderRepository`) were seeded as bare file-wide typeMap keys, causing `this.repo` inside `UserService` to resolve to `OrderRepository` when both classes had a `repo` field (issue #1458). Both extractors (TS `handleFieldDefTypeMap` and Rust `field_definition` branch) now seed `ClassName.field` keys at confidence 0.9, matching the `CallerClass.X` resolver fallback added in PR #1382. Bare keys are kept at confidence 0.6 as fallbacks for single-class files or class expressions where no enclosing class name is available. Both engines change identically — parity preserved.
…ed names
The resolution benchmark uses WASM-built graphs where the Elixir, Julia,
and Objective-C extractors emit module-qualified symbol names (Main.run,
App.main, UserService.create_user, etc.). The expected-edges manifests
were written with bare unqualified names (run, main, create_user), so
every correctly-resolved edge appeared as a false positive and every
expected edge appeared as a false negative — causing all three languages
to show 0% precision even though resolution was working correctly.
Root cause: starting in v3.12.0, cross-module call resolution began working
for these languages (via the improved receiver-dispatch and same-class
fallback in resolveByMethodOrGlobal / build-edges.ts). With 0 edges
previously resolved, the name mismatch was invisible; once edges started
resolving, the manifests showed 17 FP (elixir), 11 FP (julia), 6 FP
(objc) — all correctly resolved edges misidentified as false positives.
Fix:
- Update all three expected-edges.json manifests to use the
module-qualified names matching actual extractor output:
elixir: Main.run, UserService.create_user, Validators.validate_user, etc.
julia: App.main, Service.create_user, Repository.new_repo, etc.
objc: full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.)
plus add main -> run (plain C call correctly resolved)
- Ratchet THRESHOLDS for all three:
elixir: precision 0.0 -> 1.0, recall 0.0 -> 0.8 (17/21 resolved)
julia: precision 0.0 -> 1.0, recall 0.0 -> 0.7 (11/15 resolved)
objc: precision 0.0 -> 1.0, recall 0.0 -> 0.4 (6/13 resolved)
Remaining FNs are genuine unresolved edges (same-file bare calls in
elixir/julia, receiver-typed message sends in objc) — not regressions.
Closes #1447
|
Claude finished @carlos-alm's task —— View job PR Review — fix(bench): update elixir/julia/objc expected-edges to module-qualified names
Root Cause AnalysisProblem Being Solved: The expected-edges manifests for elixir, julia, and objc contained bare unqualified function names ( Why This Approach: The fix is surgical and correct: update the manifests to match what the extractors actually emit, then ratchet the thresholds to reflect real performance. This is the right direction — the manifests are authoritative ground truth, so they must match actual extractor behavior. The alternative (changing the extractor to emit bare names) would make Elixir/Julia/ObjC inconsistent with every other module-aware language the tool supports. Risk Assessment: Low. This is a test-data correction, not a production code change. The extractors are unchanged. The thresholds are ratcheted up (from 0.0 to 1.0 precision), which is strictly more restrictive. The PR cannot introduce runtime regressions. Phase 2: Configuration Change DetectionNo Biome rules, TypeScript flags, or build configs touched. The only Phase 3: Test IntegrityNo assertions weakened. The change makes tests more strict:
One subtlety worth verifying: the PR description says "81% recall" for elixir but sets the threshold to 0.8 (80%). That's a small but intentional buffer — defensible and consistent with how other languages are handled. No concern. The Phase 4: Code Quality
void run(void) { ... }
int main(int argc, const char *argv[]) {
@autoreleasepool { run(); }
}Both ObjC selector names verified against
Elixir manifest verified against Julia manifest follows the same module-qualification pattern consistently. ✓ Minor nit ( // elixir: cross-module qualified calls resolve at 100% precision, 81% recall.
elixir: { precision: 1.0, recall: 0.8 },The comment slightly overstates actual precision ("81%") while the threshold is set to 80%. This is intentional headroom, but the discrepancy in the comment could confuse future readers into thinking the manifest was designed to hit exactly 81% when CI only requires 80%. Not blocking, but worth a follow-up fix to say "≥80% recall" or "11/14 recall" to match the actual fraction. Same applies to julia ("73%" comment, Phase 5: Backlog Compliance
Critical ConcernsNone. The changes are well-scoped, the root cause is correctly identified, and the fix is the right approach. The thresholds are ratcheted upward (more strict), not downward. The only genuine question would be: Was the extractor actually correct before v3.12.0? The PR says v3.12.0 improved cross-module resolution, which revealed the manifest mismatch. If zero edges resolved before, the manifests at 0.0/0.0 were vacuously correct — this PR converts them to genuinely informative thresholds. That's a net improvement to test quality. Final Recommendation
|
Greptile SummaryFixes the root cause of #1447 by updating three benchmark fixture manifests (elixir, julia, objc) to use module-qualified names that match actual extractor output, and ratchets their precision/recall thresholds from 0% to real measured values.
Confidence Score: 4/5Safe to merge after removing the duplicate The benchmark fixture corrections are straightforward and well-reasoned. The only concern is that src/domain/graph/builder/stages/native-orchestrator.ts — the duplicate Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Resolution Benchmark Test] --> B{Load expected-edges.json}
B --> C[Elixir fixture\nModule-qualified names\ne.g. Main.run, Validators.validate_user]
B --> D[Julia fixture\nModule-qualified names\ne.g. App.main, Service.create_user]
B --> E[ObjC fixture\nFull ObjC selectors\ne.g. createUserWithId:name:email:]
C --> F[Compare vs extractor output]
D --> F
E --> F
F --> G{Precision / Recall\ncheck vs THRESHOLDS}
G -->|elixir: P=1.0, R>=0.8| H[Pass]
G -->|julia: P=1.0, R>=0.7| H
G -->|objc: P=1.0, R>=0.4| H
G -->|fail| I[Test failure]
Reviews (7): Last reviewed commit: "fix: resolve merge conflicts with main" | Re-trigger Greptile |
| // objc: class-method static calls and same-class calls resolve at 100% precision, 46% recall. | ||
| // Expected-edges now use full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.) | ||
| // matching what the ObjC extractor emits. Receiver-typed instance message sends | ||
| // are not yet resolved — tracked as FNs. Precision 1.0 acts as ratchet against future FPs. |
There was a problem hiding this comment.
The ObjC threshold comment says only "Receiver-typed instance message sends are not yet resolved", but the
constructor mode edge (run → UserService.initWithRepository:) is also unresolved. With 13 total edges, 6 are resolved (4 static + 2 same-file), giving 6/13 = 46.2% recall. If the constructor were a true positive, recall would be 7/13 = 53.8% — inconsistent with the "46% recall" stated in the comment. The missing mention of the constructor FN could mislead future contributors auditing which modes are covered.
| // objc: class-method static calls and same-class calls resolve at 100% precision, 46% recall. | |
| // Expected-edges now use full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.) | |
| // matching what the ObjC extractor emits. Receiver-typed instance message sends | |
| // are not yet resolved — tracked as FNs. Precision 1.0 acts as ratchet against future FPs. | |
| // objc: class-method static calls and same-class calls resolve at 100% precision, 46% recall. | |
| // Expected-edges now use full ObjC selectors (createUserWithId:name:email:, isValidEmail:, etc.) | |
| // matching what the ObjC extractor emits. Constructor message sends (initWithRepository:) and | |
| // receiver-typed instance message sends are not yet resolved — tracked as FNs. | |
| // Precision 1.0 acts as ratchet against future FPs. |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Fixed — the comment now lists both constructor message sends (initWithRepository:) and receiver-typed instance message sends as unresolved FNs, so future contributors see the complete picture of what is and isn't covered.
…nce threshold constant
The secondary docstring still described the old 16-value rationale
("engine-parity drop sizes"). Replace with a pointer to
INLINE_BACKFILL_THRESHOLD where the full rationale now lives.
…bjc threshold comment The previous comment only listed receiver-typed instance message sends as unresolved false negatives, omitting the constructor edge (run → UserService.initWithRepository:). Both modes are FNs; the comment now lists them together so future contributors see the complete picture.
…r this. receivers When handleFieldDefTypeMap seeds ClassName.field at 0.9 and bare field/ this.field at 0.6, the resolver was still finding the bare 0.6 entry first (effectiveReceiver lookup) and skipping the class-scoped check entirely. The result: cross-class collision from issue #1458 persisted at runtime even though the extractor emitted the right keys. Fix: for this. receivers, check the class-scoped key (ClassName.prop) first, then fall back to bare keys. Both TS (call-resolver.ts) and Rust (build_edges.rs) resolvers are updated identically. Fixes #1458.
…-end integration test - Rust: field_annotation_multi_class_seeds_separate_scoped_keys confirms that two classes with identically-named fields produce separate class-scoped typeMap keys at confidence 0.9 (mirrors the TS prevents-cross-class-collision test). - Integration: issue-1458-cross-class-field-typemap.test.ts exercises the full buildGraph → resolver path (WASM engine) and asserts that OrderService.run resolves to OrderRepository.save and UserService.run to UserRepository.save, with no cross-class false edges.
The field_annotation_multi_class_seeds_separate_scoped_keys test used parse_js() (tree-sitter-javascript), but the test fixture uses TypeScript syntax (private field declarations with type annotations). The JS grammar does not support type annotations, so the type_map came back empty and the assertion failed. Add parse_ts() helper using LANGUAGE_TYPESCRIPT and switch the test to it. All 396 Rust unit tests pass.
…58' into fix/v3120-regressions-1446-1447
Codegraph Impact Analysis5 functions changed → 0 callers affected across 0 files
|
Summary
run,main,create_user), but the extractors emit module-qualified names (Main.run,App.main,UserService.create_user, full ObjC selectors). When v3.12.0's improved cross-module resolution started successfully resolving edges for these languages, every correct edge appeared as a false positive (name mismatch with manifest) and every expected edge appeared as false negative — driving global precision from 89.9% to 84.4%.mod.rsis clean (each language has its own extractor). The real cause was stale expected-edges manifests that never matched actual extractor output.KNOWN_REGRESSIONSas3.12.0:Full buildand3.12.0:1-file rebuild. No code change needed for that issue — the benchmark guard already accounts for it.Changes:
Closes #1447